Author: “Purba Roy”


We will need the following R packages.

# Load standard libraries
library(tidyverse)
library(nycflights13)

Exploring the NYC Flights Data

Here, we will use the data on all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013. We find this data in the nycflights13 R package.

# Load the nycflights13 library which includes data on all
# lights departing NYC
data(flights)
# Note the data itself is called flights, we will make it into a local df
# for readability
flights <- tbl_df(flights)
# Look at the help file for information about the data
 ?flights
## starting httpd help server ... done
flights
#view(flights)
# summary(flights)
Importing and Inspecting the data

On inspecting the dataset, it shows that the data was collected from RITA, Bureau of transportation statistics, and it gives us the details about all the flights that departed New York, namely the 3 airports- JFK, LGA and EWR in 2013. The dataset consists of 19 variables where year, month and day depict the exact date on which the flights departed by giving us the year, month and exact day of the journey respectively. The dep_time indicates departure time in an Hour-Minute format (HHMM/HMM). The sched_dep_time gives us the scheduled departure time in an hour-minute format. dep_delay represents the difference between the scheduled departure time and actual departure time in minutes. Similarly arr_time, sched_arr_time and arr_delay are the arrival time, scheduled arrival time and the delay between those two, respectively. The carrier variable represents abbreviation of the airline names used for the journey.Flight and tailnum are the flight number, and the flight tail number. Origin and dest stand for the airports used for takeoff and destination respectively. air_time is the total time of the journey in minutes. distance is the total distance between the source and the destination in miles. The scheduled departure time is broken into 2 parts, in hours and minutes. this is captured in the hour and minute variables respectively. The last variable time_hour gives the entire date along with time for the scheduled departure time.

head(flights,9)

The head function gives us a basic overview of the data in a tabular format. It showed that there were 336776 flights in total that departed from New York in the year 2013.

str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

I found that the variables- year, month and day are assigned as integers rather than datetime. The only variable that is assigned as datetime is time_hour.

summary(flights)
##       year          month             day           dep_time   
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
##                                                  NA's   :8255  
##  sched_dep_time   dep_delay          arr_time    sched_arr_time
##  Min.   : 106   Min.   : -43.00   Min.   :   1   Min.   :   1  
##  1st Qu.: 906   1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124  
##  Median :1359   Median :  -2.00   Median :1535   Median :1556  
##  Mean   :1344   Mean   :  12.64   Mean   :1502   Mean   :1536  
##  3rd Qu.:1729   3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945  
##  Max.   :2359   Max.   :1301.00   Max.   :2400   Max.   :2359  
##                 NA's   :8255      NA's   :8713                 
##    arr_delay          carrier              flight       tailnum         
##  Min.   : -86.000   Length:336776      Min.   :   1   Length:336776     
##  1st Qu.: -17.000   Class :character   1st Qu.: 553   Class :character  
##  Median :  -5.000   Mode  :character   Median :1496   Mode  :character  
##  Mean   :   6.895                      Mean   :1972                     
##  3rd Qu.:  14.000                      3rd Qu.:3465                     
##  Max.   :1272.000                      Max.   :8500                     
##  NA's   :9430                                                           
##     origin              dest              air_time        distance   
##  Length:336776      Length:336776      Min.   : 20.0   Min.   :  17  
##  Class :character   Class :character   1st Qu.: 82.0   1st Qu.: 502  
##  Mode  :character   Mode  :character   Median :129.0   Median : 872  
##                                        Mean   :150.7   Mean   :1040  
##                                        3rd Qu.:192.0   3rd Qu.:1389  
##                                        Max.   :695.0   Max.   :4983  
##                                        NA's   :9430                  
##       hour           minute        time_hour                  
##  Min.   : 1.00   Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 9.00   1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :13.00   Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :13.18   Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:17.00   3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :23.00   Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 

This gave me the distribution of values for each variable (only for int and datetime datatypes), where we find out that there has been a delay in the scheduled departure time of 12.64 minutes in the entire year.

Formualting questions on the NYC flighst data
  • The first question I would ask would be : for which month is the delay the highest (departure + arrival delay), and why. I feel this is interetsing to understand the pattern of flights around the year. Does the weather during a particular affect the delay or is it some other cause? To understand the delay more, Which carrier gives the highest delay? Is there a delay because of a carrier or weather conditions in a month.

  • The second question I found intersting is the relation between source airport and carrier. This will help me understand which airline has the highest frequency of operation in which airport in New york.

Exploring Data

For each of the questions we proposed above, we perform an exploratory data analysis designed to address the question.

For the first question, I have plotted the graph between Month and delay (departure + arrival delay), and it shows that during July, there has been the highest delay. Also, there has been a substantial dip in delay for October- November. This means that there has been less delays during that time. So there is a possiblilty that weather conditions during months, or conditions such as over booking, more flights due to some occassion may affect the delay.

flights$totalDelay<-flights$dep_delay+flights$arr_delay

ggplot(data = flights)+ 
  geom_smooth(mapping = aes(x = month, y = totalDelay ), na.rm= TRUE)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

To examine the relation between carrier and delay, I mapped a boxplot.This is however a widespread graph, and gives us less information on the median values.

ggplot(data = flights) +
  geom_boxplot(mapping = aes(x = carrier, y = totalDelay), na.rm=TRUE) 

For the 2nd question, I decided to explore the relation between airlines and source airport. To understand that, I plotted facet graphs to understand each relation individually using the facet_wrap function. It showed me that in JFK, Jetblue(B6) is the most used airline compared to other airlines.

ggplot(data=flights, aes(x=origin, group=carrier, fill=carrier)) +
    geom_density(adjust=1.5) +
    facet_wrap(~carrier) 

I then combined the individual observations to get a better comparitve understanding of the relation. By overlapping the densities, I can posit that there is a higher frequency of carriers departing from JFK comparitively.

ggplot(data=flights, aes(x=origin, group=carrier, fill=carrier)) +
    geom_density(adjust=1.5, alpha=.4) 

Tackling generic questions on the flight dataset.

How many fligts out of NYC are there in the data?

dim(flights)
## [1] 336776     20
# Ans: 336776   
# Dim gives us the total number of rows and columns, which give us the number of flights departing NYC. 

How many NYC airports are included in this data? Which airports are these?

length(unique(flights$origin))
## [1] 3
# Ans: 3 NYC airports are included in the data for departure. We used Unique function to get the distinct values of the #departure airport.

Into how many airports did the airlines fly from NYC in 2013?

length(unique(flights$dest))
## [1] 105
# Ans: The airplanes flew into 105 airports. We used the distinct column to fetch the airport details.

How many flights were there from NYC to Seattle (airport code SEA)?

p <- dim(filter (flights, dest == "SEA"))
p
## [1] 3923   20
# Ans: there were 3923 that landed in Seattle. We used the Filter function to extract the results.

Were the any flights from NYC to Spokane GAG?

GAG <- dim(filter (flights, dest == "GAG"))
GAG
## [1]  0 20
# Ans: No, there werren't any, as teh result came to 0.

Checking if there are any destinations that do not look like valid airport codes (i.e. three-letter-all-upper case)?

lower <- str_detect(flights$dest, "^[:lower:]+$")

# This is to find all the destination airports with lower case codes

three <- nchar(flights$dest)
#three[three ==3]
# this is to find the character length of destination airports so that we can compare it with 3

lower <- str_detect(flights$dest, "^[:lower:]+$")

#length(lower[lower==TRUE])
# To detect lower case charaters

charc <- grepl("^[A-Za-z]+$", flights$dest, perl = T) 
#charc
# To check if the column has only alphabets and no numerical values

DestinationCode <- filter(flights, is.na(flights$dest) & nchar(flights$dest)!=3 & lower ==TRUE & charc==FALSE)
DestinationCode
# ANS: 0 values with invalid airport codes

What is the typical delay of flights in this data?

mean(flights$arr_delay[flights$arr_delay>0], na.rm=TRUE)
## [1] 40.3425
#flights %>% summarise(mean(arr_delay), rm.na=TRUE)

#mean(flights$totalDelay[flights$totalDelay>0], na.rm=TRUE)


#ANS:  typical arrival delay = 40.3425 minutes

Which ones are the worst three destinations from NYC if we don’t like flight delays?

#sort(flights$arr_delay,decreasing=TRUE)

flights[order(flights$arr_delay, decreasing = TRUE),c("arr_delay","dest")]
#ANs: HNL, CMH, ORD

How many flights were there from NYC airports to Portland in 2013?

p <- dim(filter (flights, dest == "PDX"))
p
## [1] 1354   20
#1354

How many airlines fly from NYC to Portland?

#grp_by <- group_by(flights,carrier)

unique(flights$carrier[flights$dest=="PDX"])
## [1] "DL" "UA" "B6"

Which are these airlines (find the 2-letter abbreviations)? How many times did each of these go to Portland?

gr <- group_by(flights, carrier,dest)
gr
p <- summarise(gr, count=n())
p
newdata <- flights[ which(flights$dest=='PDX'
& flights$carrier =="DL"), ]
dim(newdata)
## [1] 458  20
#ans: 458

newdata <- flights[ which(flights$dest=='PDX'
& flights$carrier =="UA"), ]
dim(newdata)
## [1] 571  20
# ANs: 571

newdata <- flights[ which(flights$dest=='PDX'
& flights$carrier =="B6"), ]
dim(newdata)
## [1] 325  20
# Ans: 325

How many different airplanes arrived from each of the three NYC airports to Portland?

p <-unique(flights$origin)

p
## [1] "EWR" "LGA" "JFK"
p <-unique(flights$tailnum[flights$dest=="PDX" & flights$origin=="JFK"])


length(p)
## [1] 195
p <-unique(flights$tailnum[flights$dest=="PDX" & flights$origin=="EWR"])


length(p)
## [1] 297
p <-unique(flights$tailnum[flights$dest=="PDX" & flights$origin=="LGA"])


length(p)
## [1] 0

What percentage of flights to Portland were delayed at departure by more than 15 minutes?

p <- filter(flights, dep_delay > 15, dest=="PDX")
count(p)
l <- filter(flights, dest=="PDX")
count(l)
count(p) / count(l)*100

Seasonal Delays. Lets check teh season delays for teh flight dataset.

#graphical:

ggplot(data = flights) + 
  geom_smooth(mapping = aes(x = month, y = arr_delay, colour=origin))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9430 rows containing non-finite values (stat_smooth).

ggplot(data = flights) +
  geom_histogram(mapping = aes(x = month), binwidth = 0.5)

#Tabular:
grp_by <- group_by(flights,month)

#summarise(grp_by,delay = mean(dep_delay, na.rm = TRUE))


ggplot(data = flights) +
  geom_histogram(mapping = aes(x=month, binwidth = 0.1)) +
  geom_smooth(mapping = aes(x = month, y= arr_delay))
## Warning: Ignoring unknown aesthetics: binwidth
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9430 rows containing non-finite values (stat_smooth).

flights %>% 
  count(month)
head(flights,100)
summary(flights[order(flights$arr_delay, decreasing = TRUE),c("arr_delay","dest","month")])
##    arr_delay            dest               month       
##  Min.   : -86.000   Length:336776      Min.   : 1.000  
##  1st Qu.: -17.000   Class :character   1st Qu.: 4.000  
##  Median :  -5.000   Mode  :character   Median : 7.000  
##  Mean   :   6.895                      Mean   : 6.549  
##  3rd Qu.:  14.000                      3rd Qu.:10.000  
##  Max.   :1272.000                      Max.   :12.000  
##  NA's   :9430